Valley of Disappointment

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Survey

menti.com (7417 4931). https://www.menti.com/alt1s23cj7ie.

 
 
 
 
 
 
 
 
 
 

Introduction

Atomic vector numeric, letter

Challenge 1.2

Indexing.

Ordering elements.

Conditional subsetting.

Challenge 1.3

Function

Missing data

Challenge 1.4

  • Do you have data_raw directory?
## load the tidyverse packages, incl. dplyr
library(tidyverse)

When you load tidyverse:
- dplyr: collection of tools to easily manipulate the data
- tidyr: reshaping the data for plotting
- ggplot2: for advanced plotting
- tibble: modern dataframe

dplyr cheatsheet
tidyr cheatsheet

download.file(
  url = "https://nbisweden.github.io/module-r-intro-dm-practices/data/Hawks.csv",
  destfile = "data_raw/Hawks.csv"
)

read_csv from tidyverse package == read.csv from base R.

# silently read in CSV file with data
hawks <- read_csv("data_raw/Hawks.csv") #, col_names = T)
## Rows: 908 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): CaptureTime, BandNumber, Species, Age, Sex
## dbl  (13): Month, Day, Year, Wing, Weight, Culmen, Hallux, Tail, StandardTai...
## time  (1): ReleaseTime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(hawks)
## spc_tbl_ [908 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Month       : num [1:908] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Day         : num [1:908] 19 22 23 23 27 28 28 29 29 30 ...
##  $ Year        : num [1:908] 1992 1992 1992 1992 1992 ...
##  $ CaptureTime : chr [1:908] "13:30" "10:30" "12:45" "10:50" ...
##  $ ReleaseTime : 'hms' num [1:908] NA NA NA NA ...
##   ..- attr(*, "units")= chr "secs"
##  $ BandNumber  : chr [1:908] "877-76317" "877-76318" "877-76319" "745-49508" ...
##  $ Species     : chr [1:908] "RT" "RT" "RT" "CH" ...
##  $ Age         : chr [1:908] "I" "I" "I" "I" ...
##  $ Sex         : chr [1:908] NA NA NA "F" ...
##  $ Wing        : num [1:908] 385 376 381 265 205 412 370 375 412 405 ...
##  $ Weight      : num [1:908] 920 930 990 470 170 1090 960 855 1210 1120 ...
##  $ Culmen      : num [1:908] 25.7 NA 26.7 18.7 12.5 28.5 25.3 27.2 29.3 26 ...
##  $ Hallux      : num [1:908] 30.1 NA 31.3 23.5 14.3 32.2 30.1 30 31.3 30.2 ...
##  $ Tail        : num [1:908] 219 221 235 220 157 230 212 243 210 238 ...
##  $ StandardTail: num [1:908] NA NA NA NA NA NA NA NA NA NA ...
##  $ Tarsus      : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
##  $ WingPitFat  : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
##  $ KeelFat     : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
##  $ Crop        : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Month = col_double(),
##   ..   Day = col_double(),
##   ..   Year = col_double(),
##   ..   CaptureTime = col_character(),
##   ..   ReleaseTime = col_time(format = ""),
##   ..   BandNumber = col_character(),
##   ..   Species = col_character(),
##   ..   Age = col_character(),
##   ..   Sex = col_character(),
##   ..   Wing = col_double(),
##   ..   Weight = col_double(),
##   ..   Culmen = col_double(),
##   ..   Hallux = col_double(),
##   ..   Tail = col_double(),
##   ..   StandardTail = col_double(),
##   ..   Tarsus = col_double(),
##   ..   WingPitFat = col_double(),
##   ..   KeelFat = col_double(),
##   ..   Crop = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

When you execute read_csv on a data file, it looks through the first 1000 rows of each column and guesses its data type. Use the function spec() to view the full column specification:

Dataframes

columns = vector (datatype)

Inspecting the dataframes

view(hawks) 
hawks
## # A tibble: 908 × 19
##    Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex  
##    <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr>
##  1     9    19  1992 13:30          NA       877-76317  RT      I     <NA> 
##  2     9    22  1992 10:30          NA       877-76318  RT      I     <NA> 
##  3     9    23  1992 12:45          NA       877-76319  RT      I     <NA> 
##  4     9    23  1992 10:50          NA       745-49508  CH      I     F    
##  5     9    27  1992 11:15          NA       1253-98801 SS      I     F    
##  6     9    28  1992 11:25          NA       1207-55910 RT      I     <NA> 
##  7     9    28  1992 13:30          NA       877-76320  RT      I     <NA> 
##  8     9    29  1992 11:45          NA       877-76321  RT      A     <NA> 
##  9     9    29  1992 15:35          NA       877-76322  RT      A     <NA> 
## 10     9    30  1992 13:45          NA       1207-55911 RT      I     <NA> 
## # ℹ 898 more rows
## # ℹ 10 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## #   Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## #   KeelFat <dbl>, Crop <dbl>
head(hawks)
## # A tibble: 6 × 19
##   Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex    Wing
##   <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr> <dbl>
## 1     9    19  1992 13:30          NA       877-76317  RT      I     <NA>    385
## 2     9    22  1992 10:30          NA       877-76318  RT      I     <NA>    376
## 3     9    23  1992 12:45          NA       877-76319  RT      I     <NA>    381
## 4     9    23  1992 10:50          NA       745-49508  CH      I     F       265
## 5     9    27  1992 11:15          NA       1253-98801 SS      I     F       205
## 6     9    28  1992 11:25          NA       1207-55910 RT      I     <NA>    412
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## #   StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## #   Crop <dbl>
tail(hawks)
## # A tibble: 6 × 19
##   Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex    Wing
##   <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr> <dbl>
## 1    11    18  2003 14:07          NA       1207-53144 RT      I     <NA>    366
## 2    11    18  2003 14:44          NA       1177-04777 RT      I     <NA>    380
## 3    11    19  2003 10:18          NA       803-05985  SS      I     F       190
## 4    11    19  2003 12:02          NA       1807-53145 RT      I     <NA>    360
## 5    11    20  2003 9:56           NA       1177-04778 RT      I     <NA>    369
## 6    11    20  2003 13:30          NA       1207-53145 RT      A     <NA>    199
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## #   StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## #   Crop <dbl>
str(hawks)#structure
## spc_tbl_ [908 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Month       : num [1:908] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Day         : num [1:908] 19 22 23 23 27 28 28 29 29 30 ...
##  $ Year        : num [1:908] 1992 1992 1992 1992 1992 ...
##  $ CaptureTime : chr [1:908] "13:30" "10:30" "12:45" "10:50" ...
##  $ ReleaseTime : 'hms' num [1:908] NA NA NA NA ...
##   ..- attr(*, "units")= chr "secs"
##  $ BandNumber  : chr [1:908] "877-76317" "877-76318" "877-76319" "745-49508" ...
##  $ Species     : chr [1:908] "RT" "RT" "RT" "CH" ...
##  $ Age         : chr [1:908] "I" "I" "I" "I" ...
##  $ Sex         : chr [1:908] NA NA NA "F" ...
##  $ Wing        : num [1:908] 385 376 381 265 205 412 370 375 412 405 ...
##  $ Weight      : num [1:908] 920 930 990 470 170 1090 960 855 1210 1120 ...
##  $ Culmen      : num [1:908] 25.7 NA 26.7 18.7 12.5 28.5 25.3 27.2 29.3 26 ...
##  $ Hallux      : num [1:908] 30.1 NA 31.3 23.5 14.3 32.2 30.1 30 31.3 30.2 ...
##  $ Tail        : num [1:908] 219 221 235 220 157 230 212 243 210 238 ...
##  $ StandardTail: num [1:908] NA NA NA NA NA NA NA NA NA NA ...
##  $ Tarsus      : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
##  $ WingPitFat  : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
##  $ KeelFat     : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
##  $ Crop        : num [1:908] NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Month = col_double(),
##   ..   Day = col_double(),
##   ..   Year = col_double(),
##   ..   CaptureTime = col_character(),
##   ..   ReleaseTime = col_time(format = ""),
##   ..   BandNumber = col_character(),
##   ..   Species = col_character(),
##   ..   Age = col_character(),
##   ..   Sex = col_character(),
##   ..   Wing = col_double(),
##   ..   Weight = col_double(),
##   ..   Culmen = col_double(),
##   ..   Hallux = col_double(),
##   ..   Tail = col_double(),
##   ..   StandardTail = col_double(),
##   ..   Tarsus = col_double(),
##   ..   WingPitFat = col_double(),
##   ..   KeelFat = col_double(),
##   ..   Crop = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
spec(hawks)
## cols(
##   Month = col_double(),
##   Day = col_double(),
##   Year = col_double(),
##   CaptureTime = col_character(),
##   ReleaseTime = col_time(format = ""),
##   BandNumber = col_character(),
##   Species = col_character(),
##   Age = col_character(),
##   Sex = col_character(),
##   Wing = col_double(),
##   Weight = col_double(),
##   Culmen = col_double(),
##   Hallux = col_double(),
##   Tail = col_double(),
##   StandardTail = col_double(),
##   Tarsus = col_double(),
##   WingPitFat = col_double(),
##   KeelFat = col_double(),
##   Crop = col_double()
## )
summary(hawks)
##      Month             Day             Year      CaptureTime       
##  Min.   : 8.000   Min.   : 1.00   Min.   :1992   Length:908        
##  1st Qu.: 9.000   1st Qu.: 9.00   1st Qu.:1995   Class :character  
##  Median :10.000   Median :16.00   Median :1999   Mode  :character  
##  Mean   : 9.843   Mean   :15.74   Mean   :1998                     
##  3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:2001                     
##  Max.   :11.000   Max.   :31.00   Max.   :2003                     
##                                                                    
##  ReleaseTime        BandNumber          Species              Age           
##  Length:908        Length:908         Length:908         Length:908        
##  Class1:hms        Class :character   Class :character   Class :character  
##  Class2:difftime   Mode  :character   Mode  :character   Mode  :character  
##  Mode  :numeric                                                            
##                                                                            
##                                                                            
##                                                                            
##      Sex                 Wing           Weight           Culmen    
##  Length:908         Min.   : 37.2   Min.   :  56.0   Min.   : 8.6  
##  Class :character   1st Qu.:202.0   1st Qu.: 185.0   1st Qu.:12.8  
##  Mode  :character   Median :370.0   Median : 970.0   Median :25.5  
##                     Mean   :315.6   Mean   : 772.1   Mean   :21.8  
##                     3rd Qu.:390.0   3rd Qu.:1120.0   3rd Qu.:27.3  
##                     Max.   :480.0   Max.   :2030.0   Max.   :39.2  
##                     NA's   :1       NA's   :10       NA's   :7     
##      Hallux            Tail        StandardTail       Tarsus     
##  Min.   :  9.50   Min.   :119.0   Min.   :115.0   Min.   :24.70  
##  1st Qu.: 15.10   1st Qu.:160.0   1st Qu.:162.0   1st Qu.:55.60  
##  Median : 29.40   Median :214.0   Median :215.0   Median :79.30  
##  Mean   : 26.41   Mean   :198.8   Mean   :199.2   Mean   :71.95  
##  3rd Qu.: 31.40   3rd Qu.:225.0   3rd Qu.:226.0   3rd Qu.:87.00  
##  Max.   :341.40   Max.   :288.0   Max.   :335.0   Max.   :94.00  
##  NA's   :6                        NA's   :337     NA's   :833    
##    WingPitFat        KeelFat           Crop       
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :1.0000   Median :2.000   Median :0.0000  
##  Mean   :0.7922   Mean   :2.184   Mean   :0.2345  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:0.2500  
##  Max.   :3.0000   Max.   :4.000   Max.   :5.0000  
##  NA's   :831      NA's   :341     NA's   :343
nrow(hawks)
## [1] 908
ncol(hawks)
## [1] 19
dim(hawks)
## [1] 908  19
hawks$CaptureTimeptureTime[!is.character(hawks$CaptureTime)]
## Warning: Unknown or uninitialised column: `CaptureTimeptureTime`.
## NULL

Indexing and subsetting dataframes

# first element in the first column of the data frame
hawks[1, 1]   
## # A tibble: 1 × 1
##   Month
##   <dbl>
## 1     9
# first element in the 6th column
hawks[1, 6]   
## # A tibble: 1 × 1
##   BandNumber
##   <chr>     
## 1 877-76317
# first column of the data frame
hawks[, 1]    
## # A tibble: 908 × 1
##    Month
##    <dbl>
##  1     9
##  2     9
##  3     9
##  4     9
##  5     9
##  6     9
##  7     9
##  8     9
##  9     9
## 10     9
## # ℹ 898 more rows
# first column of the data frame
hawks[1]      
## # A tibble: 908 × 1
##    Month
##    <dbl>
##  1     9
##  2     9
##  3     9
##  4     9
##  5     9
##  6     9
##  7     9
##  8     9
##  9     9
## 10     9
## # ℹ 898 more rows
# first three rows of the 6th column
hawks[1:3, 6] 
## # A tibble: 3 × 1
##   BandNumber
##   <chr>     
## 1 877-76317 
## 2 877-76318 
## 3 877-76319
# the 3rd row of the data frame
hawks[3, ]    
## # A tibble: 1 × 19
##   Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex    Wing
##   <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr> <dbl>
## 1     9    23  1992 12:45          NA       877-76319  RT      I     <NA>    381
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## #   StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## #   Crop <dbl>
# equivalent to head_hawks <- head(hawks)
head_hawks <- hawks[1:6, ] 
# dropping column


# select columns by name
hawks$Month[1:6]
## [1] 9 9 9 9 9 9
hawks[,'Month']
## # A tibble: 908 × 1
##    Month
##    <dbl>
##  1     9
##  2     9
##  3     9
##  4     9
##  5     9
##  6     9
##  7     9
##  8     9
##  9     9
## 10     9
## # ℹ 898 more rows
hawks[1,c('Year', "ReleaseTime", 'Sex')]
## # A tibble: 1 × 3
##    Year ReleaseTime Sex  
##   <dbl> <time>      <chr>
## 1  1992    NA       <NA>
sum(!is.na(hawks$Sex))
## [1] 332
hawks_known_sex <- hawks[!is.na(hawks$Sex),]
dim(hawks_known_sex)
## [1] 332  19

Challenge 2.2

Factors

Categorical data

levels(hawks$Sex)
## NULL
hawks$Sex <- factor(x = hawks$Sex, levels = c('M', 'F'), labels = c('Male', 'Female'))
levels(hawks$Sex)
## [1] "Male"   "Female"
summary(hawks$Sex)
##   Male Female   NA's 
##    158    174    576

Challenge 2.3

Plotting

plot(hawks$Sex)

sex <- hawks$Sex
levels(sex)
## [1] "Male"   "Female"
sex <- addNA(sex)
levels(sex)
## [1] "Male"   "Female" NA
levels(sex)[3] <- 'Unknown'
'Unknown1' -> levels(sex)[3] 
plot(sex)

Challenge 2.4

Bonus

x <- 1:10
exp_x <- exp(x)
plot(exp_x, type = 'l', xlab = 'time', ylab = 'progress')
lines(x * 500, col = 'red')

Data manipulation

hawks <- read_csv("data_raw/Hawks.csv") #, col_names = T)
## Rows: 908 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): CaptureTime, BandNumber, Species, Age, Sex
## dbl  (13): Month, Day, Year, Wing, Weight, Culmen, Hallux, Tail, StandardTai...
## time  (1): ReleaseTime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

“I would like to ______ given columns”

select(hawks, Species, Sex, Weight)
## # A tibble: 908 × 3
##    Species Sex   Weight
##    <chr>   <chr>  <dbl>
##  1 RT      <NA>     920
##  2 RT      <NA>     930
##  3 RT      <NA>     990
##  4 CH      F        470
##  5 SS      F        170
##  6 RT      <NA>    1090
##  7 RT      <NA>     960
##  8 RT      <NA>     855
##  9 RT      <NA>    1210
## 10 RT      <NA>    1120
## # ℹ 898 more rows
select(hawks, -BandNumber, -Culmen)
## # A tibble: 908 × 17
##    Month   Day  Year CaptureTime ReleaseTime Species Age   Sex    Wing Weight
##    <dbl> <dbl> <dbl> <chr>       <time>      <chr>   <chr> <chr> <dbl>  <dbl>
##  1     9    19  1992 13:30          NA       RT      I     <NA>    385    920
##  2     9    22  1992 10:30          NA       RT      I     <NA>    376    930
##  3     9    23  1992 12:45          NA       RT      I     <NA>    381    990
##  4     9    23  1992 10:50          NA       CH      I     F       265    470
##  5     9    27  1992 11:15          NA       SS      I     F       205    170
##  6     9    28  1992 11:25          NA       RT      I     <NA>    412   1090
##  7     9    28  1992 13:30          NA       RT      I     <NA>    370    960
##  8     9    29  1992 11:45          NA       RT      A     <NA>    375    855
##  9     9    29  1992 15:35          NA       RT      A     <NA>    412   1210
## 10     9    30  1992 13:45          NA       RT      I     <NA>    405   1120
## # ℹ 898 more rows
## # ℹ 7 more variables: Hallux <dbl>, Tail <dbl>, StandardTail <dbl>,
## #   Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>, Crop <dbl>

“I would like to ______ rows that satisfy a condition (e.g., column values equal a given value)”

filter(hawks, Sex == "F")
## # A tibble: 174 × 19
##    Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex  
##    <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr>
##  1     9    23  1992 10:50          NA       745-49508  CH      I     F    
##  2     9    27  1992 11:15          NA       1253-98801 SS      I     F    
##  3    10    27  1992 10:05          NA       1253-98802 SS      I     F    
##  4     9    29  1993 10:25          NA       1253-98803 SS      I     F    
##  5    10     1  1993 10:20          NA       745-49512  CH      I     F    
##  6    10    12  1993 13:15          NA       745-49515  CH      I     F    
##  7    10    14  1993 14:05          NA       1373-35272 SS      A     F    
##  8     9     8  1994 12:10          NA       1423-16201 SS      I     F    
##  9     9     9  1994 9:02           NA       2003-58433 SS      I     F    
## 10     9    20  1994 9:05           NA       2003-58435 SS      A     F    
## # ℹ 164 more rows
## # ℹ 10 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## #   Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## #   KeelFat <dbl>, Crop <dbl>

We can also filter rows that do not contain missing data in some columns:

filter(hawks, !is.na(Sex) & !is.na(Weight))
## # A tibble: 327 × 19
##    Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex  
##    <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr>
##  1     9    23  1992 10:50          NA       745-49508  CH      I     F    
##  2     9    27  1992 11:15          NA       1253-98801 SS      I     F    
##  3    10    23  1992 16:05          NA       1173-19901 SS      I     M    
##  4    10    27  1992 10:05          NA       1253-98802 SS      I     F    
##  5     9    13  1993 14:25          NA       173-19904  SS      I     M    
##  6     9    17  1993 15:25          NA       193-19905  SS      I     M    
##  7     9    29  1993 10:25          NA       1253-98803 SS      I     F    
##  8    10     1  1993 10:20          NA       745-49512  CH      I     F    
##  9    10     1  1993 10:45          NA       745-49513  CH      A     M    
## 10    10    11  1993 11:35          NA       1173-19906 SS      I     M    
## # ℹ 317 more rows
## # ℹ 10 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## #   Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## #   KeelFat <dbl>, Crop <dbl>

This will return all rows that have a value in both the Sex column and the Weight column. In Tidyverse, there is also a special functions drop_na that can be used to filter out rows with missing data:

drop_na(hawks, Sex, Weight)
## # A tibble: 327 × 19
##    Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex  
##    <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr>
##  1     9    23  1992 10:50          NA       745-49508  CH      I     F    
##  2     9    27  1992 11:15          NA       1253-98801 SS      I     F    
##  3    10    23  1992 16:05          NA       1173-19901 SS      I     M    
##  4    10    27  1992 10:05          NA       1253-98802 SS      I     F    
##  5     9    13  1993 14:25          NA       173-19904  SS      I     M    
##  6     9    17  1993 15:25          NA       193-19905  SS      I     M    
##  7     9    29  1993 10:25          NA       1253-98803 SS      I     F    
##  8    10     1  1993 10:20          NA       745-49512  CH      I     F    
##  9    10     1  1993 10:45          NA       745-49513  CH      A     M    
## 10    10    11  1993 11:35          NA       1173-19906 SS      I     M    
## # ℹ 317 more rows
## # ℹ 10 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## #   Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## #   KeelFat <dbl>, Crop <dbl>

Pipes (%>%)

Ctrl + Shift + M Cmd + Shift + M

x %>% f ==> f(x) x %>% f(y) ==> f(x,y)

hawks[1:5,]
## # A tibble: 5 × 19
##   Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex    Wing
##   <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr> <dbl>
## 1     9    19  1992 13:30          NA       877-76317  RT      I     <NA>    385
## 2     9    22  1992 10:30          NA       877-76318  RT      I     <NA>    376
## 3     9    23  1992 12:45          NA       877-76319  RT      I     <NA>    381
## 4     9    23  1992 10:50          NA       745-49508  CH      I     F       265
## 5     9    27  1992 11:15          NA       1253-98801 SS      I     F       205
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## #   StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## #   Crop <dbl>
head(hawks, n = 5)
## # A tibble: 5 × 19
##   Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex    Wing
##   <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr> <dbl>
## 1     9    19  1992 13:30          NA       877-76317  RT      I     <NA>    385
## 2     9    22  1992 10:30          NA       877-76318  RT      I     <NA>    376
## 3     9    23  1992 12:45          NA       877-76319  RT      I     <NA>    381
## 4     9    23  1992 10:50          NA       745-49508  CH      I     F       265
## 5     9    27  1992 11:15          NA       1253-98801 SS      I     F       205
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## #   StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## #   Crop <dbl>
hawks %>% head(n = 5)
## # A tibble: 5 × 19
##   Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex    Wing
##   <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr> <dbl>
## 1     9    19  1992 13:30          NA       877-76317  RT      I     <NA>    385
## 2     9    22  1992 10:30          NA       877-76318  RT      I     <NA>    376
## 3     9    23  1992 12:45          NA       877-76319  RT      I     <NA>    381
## 4     9    23  1992 10:50          NA       745-49508  CH      I     F       265
## 5     9    27  1992 11:15          NA       1253-98801 SS      I     F       205
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## #   StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## #   Crop <dbl>
hawks %>% head(n=5) %>% tail(n=1)
## # A tibble: 1 × 19
##   Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex    Wing
##   <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr> <dbl>
## 1     9    27  1992 11:15          NA       1253-98801 SS      I     F       205
## # ℹ 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## #   StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## #   Crop <dbl>
hawks %>%
  filter(Sex == "F" | Weight > 500) 
## # A tibble: 741 × 19
##    Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex  
##    <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr>
##  1     9    19  1992 13:30          NA       877-76317  RT      I     <NA> 
##  2     9    22  1992 10:30          NA       877-76318  RT      I     <NA> 
##  3     9    23  1992 12:45          NA       877-76319  RT      I     <NA> 
##  4     9    23  1992 10:50          NA       745-49508  CH      I     F    
##  5     9    27  1992 11:15          NA       1253-98801 SS      I     F    
##  6     9    28  1992 11:25          NA       1207-55910 RT      I     <NA> 
##  7     9    28  1992 13:30          NA       877-76320  RT      I     <NA> 
##  8     9    29  1992 11:45          NA       877-76321  RT      A     <NA> 
##  9     9    29  1992 15:35          NA       877-76322  RT      A     <NA> 
## 10     9    30  1992 13:45          NA       1207-55911 RT      I     <NA> 
## # ℹ 731 more rows
## # ℹ 10 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## #   Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## #   KeelFat <dbl>, Crop <dbl>

Challenge 3.1

“I would like to ______ new columns derived from existing columns (e.g., perform calculations or transformations)”

  • mutate
  • color
  • transpose

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use mutate().

colnames(hawks)
##  [1] "Month"        "Day"          "Year"         "CaptureTime"  "ReleaseTime" 
##  [6] "BandNumber"   "Species"      "Age"          "Sex"          "Wing"        
## [11] "Weight"       "Culmen"       "Hallux"       "Tail"         "StandardTail"
## [16] "Tarsus"       "WingPitFat"   "KeelFat"      "Crop"
hawks %>%
  mutate(Weight_kg = Weight / 1000)
## # A tibble: 908 × 20
##    Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex  
##    <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr>
##  1     9    19  1992 13:30          NA       877-76317  RT      I     <NA> 
##  2     9    22  1992 10:30          NA       877-76318  RT      I     <NA> 
##  3     9    23  1992 12:45          NA       877-76319  RT      I     <NA> 
##  4     9    23  1992 10:50          NA       745-49508  CH      I     F    
##  5     9    27  1992 11:15          NA       1253-98801 SS      I     F    
##  6     9    28  1992 11:25          NA       1207-55910 RT      I     <NA> 
##  7     9    28  1992 13:30          NA       877-76320  RT      I     <NA> 
##  8     9    29  1992 11:45          NA       877-76321  RT      A     <NA> 
##  9     9    29  1992 15:35          NA       877-76322  RT      A     <NA> 
## 10     9    30  1992 13:45          NA       1207-55911 RT      I     <NA> 
## # ℹ 898 more rows
## # ℹ 11 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## #   Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## #   KeelFat <dbl>, Crop <dbl>, Weight_kg <dbl>

You can also create a second new column based on the first new column within the same call of mutate():

hawks %>%
  mutate(Weight_kg = Weight / 1000,
         Weight_lb = Weight_kg * 2.2)
## # A tibble: 908 × 21
##    Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex  
##    <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr>
##  1     9    19  1992 13:30          NA       877-76317  RT      I     <NA> 
##  2     9    22  1992 10:30          NA       877-76318  RT      I     <NA> 
##  3     9    23  1992 12:45          NA       877-76319  RT      I     <NA> 
##  4     9    23  1992 10:50          NA       745-49508  CH      I     F    
##  5     9    27  1992 11:15          NA       1253-98801 SS      I     F    
##  6     9    28  1992 11:25          NA       1207-55910 RT      I     <NA> 
##  7     9    28  1992 13:30          NA       877-76320  RT      I     <NA> 
##  8     9    29  1992 11:45          NA       877-76321  RT      A     <NA> 
##  9     9    29  1992 15:35          NA       877-76322  RT      A     <NA> 
## 10     9    30  1992 13:45          NA       1207-55911 RT      I     <NA> 
## # ℹ 898 more rows
## # ℹ 12 more variables: Wing <dbl>, Weight <dbl>, Culmen <dbl>, Hallux <dbl>,
## #   Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>,
## #   KeelFat <dbl>, Crop <dbl>, Weight_kg <dbl>, Weight_lb <dbl>

The first few rows of the output are full of NAs, so if we wanted to remove those we could insert a filter() in the chain:

hawks %>%
  filter(!is.na(Weight)) %>%
  mutate(Weight_kg = Weight / 1000) %>%
  head()
## # A tibble: 6 × 20
##   Month   Day  Year CaptureTime ReleaseTime BandNumber Species Age   Sex    Wing
##   <dbl> <dbl> <dbl> <chr>       <time>      <chr>      <chr>   <chr> <chr> <dbl>
## 1     9    19  1992 13:30          NA       877-76317  RT      I     <NA>    385
## 2     9    22  1992 10:30          NA       877-76318  RT      I     <NA>    376
## 3     9    23  1992 12:45          NA       877-76319  RT      I     <NA>    381
## 4     9    23  1992 10:50          NA       745-49508  CH      I     F       265
## 5     9    27  1992 11:15          NA       1253-98801 SS      I     F       205
## 6     9    28  1992 11:25          NA       1207-55910 RT      I     <NA>    412
## # ℹ 10 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, Tail <dbl>,
## #   StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, KeelFat <dbl>,
## #   Crop <dbl>, Weight_kg <dbl>

Challenge 3.2

Split-apply-combine data analysis

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

“I would like to ______ my data into a summary statistic (e.g., mean, sum, count)”

  • delete
  • summarize
  • duplicate
hawks %>% 
  group_by(Sex) %>%
  summarize(mean = mean(Weight, na.rm = TRUE))
## # A tibble: 3 × 2
##   Sex    mean
##   <chr> <dbl>
## 1 F      257.
## 2 M      174.
## 3 <NA>  1090.

You can also group by multiple columns:

hawks %>%
  group_by(Species, Sex) %>%
  summarize(mean = mean(Weight, na.rm = TRUE))
## # A tibble: 9 × 3
## # Groups:   Species [3]
##   Species Sex    mean
##   <chr>   <chr> <dbl>
## 1 CH      F      490.
## 2 CH      M      348.
## 3 CH      <NA>   402 
## 4 RT      F     1147.
## 5 RT      M     1080 
## 6 RT      <NA>  1094.
## 7 SS      F      175.
## 8 SS      M      119.
## 9 SS      <NA>    95

More metrics

hawks %>%
  group_by(Species, Sex) %>%
  summarize(mean = mean(Weight, na.rm = TRUE),
            min = min(Weight, na.rm = TRUE))
## # A tibble: 9 × 4
## # Groups:   Species [3]
##   Species Sex    mean   min
##   <chr>   <chr> <dbl> <dbl>
## 1 CH      F      490.    56
## 2 CH      M      348.   155
## 3 CH      <NA>   402    324
## 4 RT      F     1147.  1120
## 5 RT      M     1080   1080
## 6 RT      <NA>  1094.   101
## 7 SS      F      175.    92
## 8 SS      M      119.    85
## 9 SS      <NA>    95     95

“I would like to ______ rows based on the values in one or more columns (e.g., sort them in ascending or descending order)”

  • copy
  • delete
  • arrange

It is sometimes useful to rearrange the result of a query to inspect the values. For instance, we can sort on min to put the lowest numbers first:

hawks %>%
  group_by(Species, Sex) %>%
  summarize(mean = mean(Weight, na.rm = TRUE),
            min = min(Weight, na.rm = TRUE)) %>% 
  arrange(min)
## # A tibble: 9 × 4
## # Groups:   Species [3]
##   Species Sex    mean   min
##   <chr>   <chr> <dbl> <dbl>
## 1 CH      F      490.    56
## 2 SS      M      119.    85
## 3 SS      F      175.    92
## 4 SS      <NA>    95     95
## 5 RT      <NA>  1094.   101
## 6 CH      M      348.   155
## 7 CH      <NA>   402    324
## 8 RT      M     1080   1080
## 9 RT      F     1147.  1120
hawks %>%
  group_by(Species, Sex) %>%
  summarize(mean = mean(Weight, na.rm = TRUE),
            min = min(Weight, na.rm = TRUE)) %>% 
  arrange(desc(min))
## # A tibble: 9 × 4
## # Groups:   Species [3]
##   Species Sex    mean   min
##   <chr>   <chr> <dbl> <dbl>
## 1 RT      F     1147.  1120
## 2 RT      M     1080   1080
## 3 CH      <NA>   402    324
## 4 CH      M      348.   155
## 5 RT      <NA>  1094.   101
## 6 SS      <NA>    95     95
## 7 SS      F      175.    92
## 8 SS      M      119.    85
## 9 CH      F      490.    56

“I would like to ______ the number of occurrences for each unique value in a column”

  • reverse
  • count
  • select

When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of rows of data for each Sex, we would do:

hawks %>%
  count(Sex) 
## # A tibble: 3 × 2
##   Sex       n
##   <chr> <int>
## 1 F       174
## 2 M       158
## 3 <NA>    576

The count() function is shorthand for something we’ve already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, hawks %>% count(Sex) is equivalent to:

hawks %>%
  group_by(Sex) %>%
  summarize(n = n())
## # A tibble: 3 × 2
##   Sex       n
##   <chr> <int>
## 1 F       174
## 2 M       158
## 3 <NA>    576

We can also combine count() with other functions such as filter(). Here we will count the number of each species with weights above 800 g.

hawks %>%
  filter(Weight > 500) %>%
  count(Species)
## # A tibble: 3 × 2
##   Species     n
##   <chr>   <int>
## 1 CH         19
## 2 RT        566
## 3 SS          2

The example above shows the use of count() to count the number of rows/observations for one factor (i.e., Species). If we wanted to count combination of factors, such as Species and Sex, we would specify the first and the second factor as the arguments of count():

hawks %>%
  filter(Weight > 500) %>%
  count(Species, Sex)
## # A tibble: 6 × 3
##   Species Sex       n
##   <chr>   <chr> <int>
## 1 CH      F        17
## 2 CH      M         2
## 3 RT      F         3
## 4 RT      M         1
## 5 RT      <NA>    562
## 6 SS      M         2

With the above code, we can proceed with arrange() to sort the table according to a number of criteria so that we have a better comparison. For instance, we might want to arrange the table above in (i) an alphabetical order of the levels of the sex and (ii) in descending order of the count:

hawks %>%
  filter(Weight > 500) %>%
  count(Species, Sex) %>% 
  arrange(Sex, desc(n))
## # A tibble: 6 × 3
##   Species Sex       n
##   <chr>   <chr> <int>
## 1 CH      F        17
## 2 RT      F         3
## 3 CH      M         2
## 4 SS      M         2
## 5 RT      M         1
## 6 RT      <NA>    562

Challenge 3.4

Exporting the data

# Filter out observations
hawks_rt_f <- hawks %>% 
  filter(Species == "RT" & Sex == "F")
write_csv(hawks_rt_f, file = "data_processed/Hawks_Red-Tailed_female.csv")